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Abstract: In this paper, we study the problem of high-dimensional ap- 
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dimension and that does not require imputation of the missing data. We 
establish non-asymptotic sparsity oracle inequalities for the estimation of 
the covariance matrix with the Frobenius and spectral norms, valid for any 
setting of the sample size and the dimension of the observations. We fur- 
ther establish minimax lower bounds showing that our rates are minimax 
optimal up to a logarithmic factor. 
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1. Introduction 

Let X,X\, . . . ,X n £ K p be i.i.d. zero mean vectors with unknown covariance 
matrix S — 1KX <g> X. Our objective is to estimate the unknown covariance 
matrix £ when the vectors X\, . . . ,X n are partially observed, that is, when 
some of their components are not observed. More precisely, we consider the 
following framework. Denote by X\ the j-th component of the vector Xi. We 
assume that each component x!- is observed independently of the others with 
probability S £ (0,1]. Note that 5 can be easily estimated by the proportion 
of observed entries. Therefore, we will assume in this paper that S is known. 
Note also that the case 5 = 1 corresponds to the standard case of fully observed 
vectors. Let (<5i J )i<i<n,i<j<p be a sequence of i.i.d. Bernoulli random variables 
with parameter S and independent from X\, . . . , X n . We observe n i.i.d. random 
vectors Y\ , . . . , Y n £ M. p whose components satisfy 

= SijX^ , 1 < i < n, 1 < j < p. (1.1) 

We can think of the Sij as masked variables. If Si j = 0, then we cannot observe 
the j-th component of Xi and the default value is assigned to Y i . Our goal 
is then to estimate S given the partial observations Y±, . . . ,Y n . 
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The statistical problem of covariancc estimation with missing observations 
is fundamental in multivariate statistics since it is often used as the first step 
to retrieve information in numerous applications where datasets with missing 
observations are common: 

1. Climate studies: n is the number of time points and p the number of obser- 
vations stations, which may sometimes fail to produce an observation due 
to break down of measure instruments. As a consequence, the generated 
datasets usually contain missing values. 

2. Gene expression micro-arrays: n is the number of measurements and p 
the number of tested genes. Despite the improvement of genes expression 
techniques, the generated datasets frequently contain missing values with 
up to 90% of genes affected. 

3. Cosmology: n is the number of images produced by a telescope and p is the 
number of pixels per image. With the development of very large telescopes 
and wide sky surveys, the generated datasets are huge but usually contain 
missing observations due to partial sky coverage or defective pixel. 

One simple strategy to deal with missing data is to exclude from the analysis 
any variable for which observations are missing, thus restricting the analysis to 
a subset of fully observed variables. In gene expression data where 90% of the 
genes are affected by missing values, we would be left with too few variables 
so that the legitimacy of the statistical analysis becomes questionable. Also, 
discarding variables with very few missing observations is a waste of available 
information. Existing procedures involve complex imputation techniques to fill 
in the missing values through computationally intensive implementation of the 
EM algorithm, see [33] and the references cited therein for more details. In 
this paper, we propose a simple procedure computationally tractable in high- 
dimension that does not require to imput missing observations or to discard any 
available observation to recover the covariance matrix S. 

Contemporary datasets are huge with both large sample size n and dimen- 
sion p and typically p n. Consequently, a question of considerable prac- 
tical interest is to perform dimension reduction, that is finding a good low- 
dimcnsional approximation for these huge datasets. This recent paradigm where 
high-dimensional objects of interest admit in fact a small intrinsic dimension 
has produced spectacular results in several fields, such as compressed sens- 
ing where it is possible to recover s-sparse vectors of dimension p with only 
n = O (s\og(ep/s)) measurements provided these measurements are carried out 
properly, see [4, 9, 13, 22, 24, 26] and the references cited therein for more details. 
An analogous result holds in matrix completion where recovery of a low-rank 
matrix A £ W xp via nuclear norm minimization is possible with as few as 
O (prlog 2 p) observed entries where r is the rank of A, provided the matrix of 
interest A satisfies some incoherence condition, see [10-12, 17, 20, 23, 25, 29, 30] 
for more details. See also [5, 21] for rank minimization approach. A popular di- 
mension reduction technique for covariance matrices is Principal Component 
Analysis (PCA), which exploits the spectrum of the sample covariance matrix. 
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In the high-dimensional setting, [18] showed that the standard PCA procedure 
is bound to fail since the sample covariance spectrum is too spread out. 

Several alternatives have been studied in the literature to provide better 
estimates of the covariance matrix in the high-dimensional setting. A popular 
approach in Gaussian graphical models consists in estimating the inverse of 
the covariance matrix (called concentration matrix) since it admits a naturally 
sparse (or approximately sparse) structure if the dependence graph is itself 
sparse. See [2, 7, 16, 27, 28, 38] and the references cited therein for more details. 
A limitation of this approach is that it does not apply to low rank matrices 
£ since the concentration matrix does not exist in this case. An other popular 
approach assumes that the unknown covariance matrix is sparse, that is most 
of the entries are exactly or approximately zero and then proposes to perform 
either entrywise thresholding or tapering of the sample covariance matrix [3, 
6, 8, 14, 31, 32]. Note that the sparsity notion adopted in this approach is not 
adapted to strongly correlated datasets with dense covariance matrix. 

In random matrix theory, an important line of work, [15, 18, 19] and the 
references cited therein studied the asymptotic distribution of the sample co- 
variance matrix eigenvalues for different settings of n and p. See also [36] for a 
very nice survey of existing non-asymptotic results on the spectral norm devi- 
ation of the sample covariance matrix from its population counterpart. In this 
paper, we adopt this approach and we will provide further details as we present 
our results. 

Note that the results derived in the works cited above do not cover datasets 
with missing observations. For instance, when the data contains no missing 
observation (S = 1), [36] established a non-asymptotic control on the stochastic 
deviation ||I]„ — £||oo of the empirical covariance matrix £„ = — X^ILi-^i ® 
Xi provided some tails conditions are satisfied by the common distribution of 
X\, . . . , X n . Exploiting these results, it is possible to establish oracle inequalities 
for the covariance version of the matrix Lasso estimator 



where S p is the set of p x p positivc-semidcfinite symmetric matrices, \\S\\2 
and ||<S||i are respectively the Frobenius and nuclear norm of S and A > is 
a rcgularization parameter that should be chosen of the order of magnitude 
of || E„ — S||oo. This estimator is the covariance version of the matrix Lasso 
estimator initially introduced in the matrix regression framework, see [25, 30] 
and the references cited therein. To the best of our knowledge, the procedure 
(1.2) has not been studied in the covariance estimation problem. 

When the data contains missing observations (5 < 1), we no longer have 
access to S n . Given the observations Y±,...,Y n , we can build the following 
empirical covariance matrix 



argminggg \\^-"n S||i + A||5||i, 



(1.2) 
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In this case, a naive approach to derive oracle inequalities consists in computing 
the matrix Lasso estimator (1.2) with E„ replaced by Erf. Unfortunately this 
approach is bound to fail since is not a good estimator of £ when 5 < 1. 
Indeed, some elementary algebra gives that E (s^) = E W with 

E^ = (5 - (5 2 )diag(E) + 5 2 £, 

where diag(E) is the p x p diagonal matrix obtained by putting all the non- 
diagonal entries of E to zero. When 5 = 1, we see that E^ 1 - 1 = E and E^ = E n . 
However, when observations are missing (5 < 1), E^' can be very far from E. 
Hence, E^ will be a poor estimator of E since it concentrates around its mean 
E^* 1 under suitable tail conditions on the distribution of X. Consequently, the 
stochastic deviation 1 1 £„ — E 1 1 oo will be too large and the matrix Lasso estimator 
(1.2) with £„ replaced by Ti n S ^ , which requires A to be of the order of magnitude 
of ||E„ — E||oo, will perform poorly since its rate of estimation grows with A. 

We present now our reconstruction procedure based on the following simple 
observation 

E = (5- 1 - (T 2 )diag (E (5) ) +S- 2 X {S \ V0<S<1. (1.3) 

Therefore, we can define the following unbiased estimator of E when the data 
set contains missing observations 

E„ = (S- 1 - <T 2 )diag (S«) + <r 2 £^. (1.4) 

Our estimator is then solution of the following penalized empirical risk mini- 
mization problem: 



E A = argmin ses 



S 



+ X\\S\\ 1 , (1.5) 

2 



where A > is a regularization parameter to be tuned properly. We note that 
this simple procedure can be computed efficiently in high-dimension since E A 
is solution of a convex minimization problem. The optimal choice of the tuning 
parameter A is of the order of magnitude of the stochastic deviation ||E n — E||oo. 
Therefore, in order to order to establish sharp oracle inequalities for (1.5), we 
need first to study the deviations of ||E„ — E||oo. This analysis is more difficult 
as compared to the study of ||E n — E||oo since we need to derive the sharp scaling 
of ||E„ - E||oo with S. 

The rest of the paper is organized as follows. In Section 2, we recall some tools 
and definitions. In Section 3, we establish oracle inequalities for the Frobenius 
and spectral norms for our procedure (1.5) and also propose a data-driven choice 
of the regularization parameter. In section 4, wc establish minimax lower bounds 
for data with missing observations 5 G (0, 1], thus showing that our procedures 
are minimax optimal up to a logarithmic factor. Finally, Section 5 contains all 
the proofs of the paper. 
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We emphasize that the results of this paper are non-asymptotic in nature, 
hold true for any setting of n,p, are minimax optimal (up to a logarithmic 
factor) and do not require the unknown covariance matrix £ to be low-rank. 
We note also that to the best of our knowledge, there exists in the literature no 
minimax lower bound result for statistical problem with missing observations. 



2. Tools and definitions 

The Z g -norms of a vector x = (x^\ • • ■ , a;( p )) T £ W is given by 




, for 1 < q < oo, and |a;|oo = max 

i<j<p 

Denote by S p the set of p x p symmetric positivc-scmidcfinitc matrices. Any 
matrix A £ S p admits the following spectral representation 

r 

A = Y^ <T i( A )»M)®u j (A) 
j=i 

where r = rank(A) is the rank of A, <Ji(A) > (72(A) > • • • > a r {A) > are 
the nonzero eigenvalues of A and ux(A), . . . , u r (A) £ W are the associated 
orthonormal eigenvectors (we also set a r +i(A) = ■ ■ ■ = <J p {A) = 0). The linear 
vector space L is the linear span of {ui(A), . . . , u r (A)} and is called support of 
A. We will denote respectively by Pl and P^ the orthogonal projections onto 
L and L . 

The Schatten g-norm of A £ S p is defined by 

/ \ 1/3 

\\A\\ q = rpJa^A^J ,forl< g <oo, and WA^ = a^A), 

Note that the trace of any S £ S p satisfies tr(5) = I^Hi- 
Recall the trace duality property: 

\tr{A T B)\ < PllxllBlloo, VA,5e M^. 

We will also use the fact that the subdifferential of the convex function A n- 
||A||i is the following set of matrices : 

r 

d\\A\\ l = {Y J U ] {A)®u ] (A) + PtWPt: WU < l}, (2.1) 

(cf. [37]). 

We recall now the definition and some basic properties of sub-exponential 
random vectors. 
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Definition 1. The ip a -norms of a real-valued random variable V are defined by 

\\V\\^ a = inf {u > : E cxp (\V\ a /u a ) < 2} , a > 1. 

We say that a random variable V with values in R is sub- exponential if \\V\\ip a < 
oo for some a > 1. If a = 2, we say that V is sub-gaussian. 

We recall some well-known properties of sub-exponential random variables: 

1. For any real- valued random variable V such that ||V|| Q < oo for some 
a > 1, we have 

supm-iE(|yn 1/m <C\\VU a , 

m>l 

where C > can depend only on a. 

2. If a real-valued random variable V is sub-gaussian, then V 2 is sub-exponential. 
Indeed, we have 

\\V 2 U 1 <2\\V\\l 2 . 

Definition 2. A random vector X € R p is sub- exponential if (X,x) are sub- 
exponential random variables for all x G W . The ip a -norms of a random vector 
X are defined by 



\X\\i, a = sup ||(X,a:)||^ 

x£Rp-.\x\ 2 = 1 



a > 1. 



We recall the Bernstein inequality for sub-exponential real-valued random 
variables (see for instance Corollary 5.17 in [36]) 

Proposition 1. Let Y±, . . . , Y n be independent centered sub- exponential random 
variables, and K = max.; ||5^ H^. Then for every t > 0, we have with probability 
at least 1 — e _t 



1 n 

71 ^ ' 



< CK 



t t 
V - 

n n 



where C > is an absolute constant. 



The following proposition is the matrix version of Bernstein's inequality for 
bounded random matrices [1] (see also Corollary 9.1 in [34]). 

Proposition 2. Let Z\, . . . , Z n be symmetric independent random matrices in 
M. pxp that satisfy E(^i) = and ||^||oc < U almost surely for some constant U 
and all i = 1, . . . , n. Define 



crz 



i 1 n 

- V ez; 



Then, for all t > 0, with probability at least 1 



we have 



Zi 



Z n 



/ t + log(2p) TT t + \og{2p) \ 
< 2 max \ oz \j , U > 
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3. Oracle inequalities 



We can now state the main result for the procedure (1.5). 

Theorem 1. Let X\, . . . ,X n be i.i.d. vectors in W with covariance matrix E. 
For any p > 2,n > 1, we have on the event A > 2||E„ — E||oo 



< inf 

2 SGS„ 



\s-n 



mini 2A||5||i 



(1 + V2) 2 



A 2 rank(S') 



< A. 



(3.1) 



(3.2) 



As we see in Theorem 1, the regularization parameter A should be chosen 
sufficiently large such that the condition A > 2||E„ — E||oo holds with probability 
close to 1. The optimal choice of A depends on the unknown distribution of the 
observations. We consider now the case of sub-gaussian random vector XeP. 

Assumption 1 (Sub-gaussian observations). The random vector X G W is 
sub-gaussian, that is \\X\\^ 2 < oo. In addition, there exist a numerical constant 
ci > such that 



E((X,u)Y > Cl ||(X, M )|K , Vue 



(3.3) 



Note that Gaussian distributions satisfy Assumption 1. Under the above con- 
dition, we can study the stochastic quantity ||E„ — E||oo and thus properly tune 
the regularization parameter A. 

The intrinsic dimension of the matrix E can be measured by the effective 
rank 

tr(E) 



r(E) := 



(3.4) 



see Section 5.4.3 in [36]. Note that we always have r(E) < rank(E). In addition, 
we can possibly have r(E) <C rank(E) for approximately low-rank matrices E, 
that is matrices E with large rank but concentrated around a low-dimensional 
subspace. Consider for instance the covariance matrix E with eigenvalues g\ = 1 
and cr 2 = • • • = o- p = 1/p, then r(E) = < p = rank(E) 

We have the following result, which requires no condition on the covariance 
matrix E. 

Proposition 3. Let X\, . . . ,X n G W be i.i.d. random vectors satisfying As- 
sumption 1. Let Y\,...,Y n be defined in (1.1) with 8 G (0,1]. Then, for any 
t > 0, we have with probability at least 1 — e~* 



< C- 



/r(E)(i + log(2p)) r(E) (t + log(2p)) 



Cl 



8h 



5 2 i 



(ci5 + t + logn) 
(3.5) 
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|tr(E n ) - tr(E)| < maxid-,-} 
c\d I V n n 

where C > is an absolute constant. 

1. The natural choice for t is of the order of magnitude log(2p). Then the 
conclusions of Proposition 3 hold true with probability at least 1 — ^ . In 
addition, if the number of measurements n is sufficiently large 

n>c^plog 2 ((2p)Vn), (3.7) 

where c > is a sufficiently large numerical constant, then an acceptable 
choice for the regularization parameter A is 



A ^HkJ«M, (3 .8) 
ci V on 

where the absolute constant C > is sufficiently large. 

2. As we claimed in the introduction, Proposition 3 requires no condition on 
E whatsoever. However, for the result to be of any practical interest, we 
need the bound in (3.5) to be small, which is the case if the condition (3.7) 
is satisfied. This condition is interesting since it shows that the number 
of measurements sufficient to guarantee a precise enough estimation of 
the spectrum of E grows with the effective rank r(E). In particular, when 
no observation is missing (8 = 1), if E is approximately low-rank so that 
r(E) <C p, then only n = O (r(E) log 2 (2p)) measurements are sufficient to 
estimate precisely the spectrum of the p x p covariance matrix E. 

3. Note that if we assume that ||V <8>y||oo = \Y\ 2 5= U a.s. for some constant 
U > 0, then we can eliminate the (c\8 + t + logn) factor in (3.5). Conse- 
quently, we can replace the condition (3.7) on the number of measurements 
by the following less restrictive one 

n>c-^log(2p), 

for some absolute constant c > sufficiently large. When there is no miss- 
ing observation (6 = 1), we obtain the standard condition on the number 
of measurements (see Remark 5.53 in [36]). When some observations are 
missing (S < 1), we have the additional quantity S 2 in the denominators 
of (3.5) and (3.7). The bound (3.5) is degraded in the case S < 1 since 
we observe less entries per measurement. Consequently, as we can see it 
in (3.7), if we denote by N(e) the number of necessary measurements to 
estimate E with a precision e when no observation is missing (8 = 1), then 
we will need at least O (iV(e)/<5 2 ) measurements in order to estimate E 
with the same precision e when some observations are missing (8 < 1). In 
Theorem 2, we prove in particular that the dependence of the bound (3.5) 
on 8 is sharp by establishing a minimax lower bound. 
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4. In the full observations case (5 = 1) and for sub-gaussian distributions 
with low rank covariance matrix S, a simple modification of the e-net 
argument used in [36] to prove Theorem 5.39 yields an inequality similar 



to (3.5) with an upper bound of the order ||E||oo y lank ^ s ) +t without any 
logarithmic factor log2p. Note however that this bound is suboptimal 
when r(S) log 2 ((2p) Vn)« rank(S) (cf the discussion below Assumption 
1 on the intrinsic dimension of a matrix). In addition, in the missing 
observations framework 8 < 1, the matrix £W can have full rank even 
if the matrix £ is low rank. Therefore the e-nct argument will yield an 

upper bound of the order || E|| oo v/ which is much larger than the bound 
derived in (3.5). 

5. Proposition 3 and Equation (3.8) give some insight on the tuning of the 
rcgularization parameter: 



A _ c V /tr(S)||Sl| 00 / log(2^ 
c\8 V n ' 

where C > is a sufficiently large absolute constant. We see that this 
choice of A depends on tr(S) and HSH^ which are typically unknown. 
Therefore we propose to use instead 



A = c /t,(E )||E„| U /5» 
5 V n 

where C > is a large enough constant. Note that the above choice of A 
does not depend on the unknown quantities ||S||oo or tr(S) and constitutes 
thus an interesting choice in practice. We prove in the next lemma that 
2||S„ — E||oo < A with probability at least 1 — 

Lemma 1. Let the assumptions of Proposition 3 be satisfied. Assume in addition 
that (3.7) holds true. Take A as in (3.9) with C > a large enough constant 
that can depend only on C\. Then, we have with probability at least 1 — ^ that 



2||Sn ~ Slice < A < ClISIloo 1/ *) , 

V on 

where C > can depend only on c\ . 

We obtain the following corollary of Theorem 1 . 

Corollary 1. Let Assumption 1 be satisfied. Assume that (3.7) is satisfied. 
Consider the estimator (1.5) with the regularization parameter A satisfying (3.9). 
Then we have, with probability at least 1 — i that 

^ ^112/ : „ f fn^ CII2 , ^11^1,2 r(£)lo g 2;p 



|E A - S|| 2 < mf | ||S - S\\ 2 2 + dllSH^ > ^ " rank(g) } , (3.10) 
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||^-S|U<C 2 ||E||^£i5l^, (3.11) 

where C\ , Ci > can depend only on c\ . 

The proof of this corollary is immediate by combining Theorem 1 with Propo- 
sition 3 and Lemma 1. 



4. Lower bounds 

For any integer 1 < r < p, define 

C r = {SeS p : r(S) < r} . 

We also introduce V r the class of probability distributions on W with covariancc 
matrix E £ C r . 

We now establish a minimax lower bound that guarantees the rates we ob- 
tained in Corollary 1 are optimal up to a logarithmic factor on the probability 
distribution class V r . In particular, the dependence of our rates on S, HEHoo and 
r(S) is sharp. 

Theorem 2. Fix 6 G (0, 1]. Let n,r > 1 be integers such that n > 6~ 2 r 2 . Let 
X\, . . . ,X n be i.i.d. random vectors in R p with covariance matrix E G C r . We 
observe n i.i.d. random vectors Y\, . . . , Y n G W such that 

Y i=Sii X i j \ l<i<n,l<j<p, 

where (Sij)i<i<n,l<j<p is an i.i.d. sequence of Bernoulli B(S) random variables 
independent of X\ , . . . , X n . 

Then, there exist absolute constants (3 G (0, 1) and c > such that 

inf sup P s f||E-S|||> C ||E||^^rank(E) N ) > /3, (4.1) 

and 



inf sup P s ( ||E-S||oo >c||E|| 00 -i/^ j > (3, (4.2) 

where inf a denotes the infimum over all possible estimators E of E based on 
Y\ , . . . , Y n . 

5. Proofs 

5.1. Proof of Theorem 1 

The proof of the first inequality adapts to covariance matrix estimation the 
arguments used in the trace regression problem to prove Theorems 1 and 11 in 
[25]. 
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PROOF. By definition of E A , we have for any S £ S p 

||E A -E|||< \\S - £||2 + A||5||i + 2(E - E n , S - E A ) - A||E A ||i. 

If A > 2||E„ — EHoo, we deduce from the previous display that 

||E A -E||2< ||S , -E||2 + 2A||S'||i, VS£S P . 

Next, a necessary and sufficient condition of minimum for problem (1.5) implies 
that there exists V £ 9||E A ||i such that for all S £ S p 

- 2(E„ - E A , E A - S) + \{V, E A - S) < 0. (5.1) 

For any S £ S p of rank r with spectral representation S = <7 j u j ® u j an< ^ 

support L, It follows from (5.1) that 

2(E A - E, E A - S) + X(V -V,± x -S)< -X(V, E A - S) + 2(E„ - E, E A - S), 

(5.2) 

for an arbitrary V £ <9||S||i. Note that (V - V, E A - S) > by monotonicity of 
subdiffcrcntials of convex functions and that the following representation holds 

r 

V = Y,u j ®u j +PtWPt, 

3=1 

where W is an arbitrary matrix with ||W||oo < 1- hi particular, there exists W 
with 1 1 W | |oo < 1 such that 

(Ptwpt^ -s) = \\Pt£ x Pt\\i- 

For this choice of W, we get from (5.2) that 

||E A - E||2+||E A -Sf 2 + A||P I L E A P I L i| 1 < ||5 - E||| 

+ A||P L (E A - S)P L \\ 1 + 2(E„ - E, E A - S), (5.3) 

where we have used the following facts 

2(E A -E,E A -S) = ||E A -Ej|2 + j|S A -5||2-||5-E||2, 

and 

r 

Uj Cg) U 

3 = 1 

For any A £ W x p define V L {A) = A- P£AP£. Set Ai = E„ - E. We have 
(Ai, E A - S) = (A 1 ,P i (E A - S)) + (A U P£ (S A - S)P£). 



E 



Uj <8>u 7 -,E A - 5" 



®u 3l P L {t x -S)P L 
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Using Cauchy-Schwarz's inequality and trace duality, we get 



|(Ai,-Pi(S A - S))| < ^rank^HAilUlE* - S|j 2 , 



\\P L {t x - S)P L \\x < v/rank(5)||S A - S|| 2 , 
\(A 1 ,Pt(± x -S)Pt)\ < IIAxl^HP^P^Il!. 

The above display combined with (5.3) give 

||£ A - £||1+||E* - + (A - 2||A 1 || 00 )||^£ A P i ± || 1 < ||S- S||l 

+ (>/2||Ai|| 00 + X)^\\± x - S\\ 2 

A decoupling argument gives 
||E A - E||^+||S A - Sg + (A - 2|| A^WPt ± x Pt Hi < \\S- 

A 1 || 00 + - r+||E A -S||i 



,V2 

Finally, we get on the event A > 2||A 1 || 00 that 

||S A - S||| < ||S- S||| + (1+ 8 ^ )2 A 2 rank(S), VS E S p . 

We now prove the spectral norm bound. Note first that the solution of (1.5) 
is given by 

^ A = E K(Sn)-^J ^(S n )® % (S n ), (5.4) 

where x+ = max{0, x} and £„ admits the spectral representation 

S„ = } j a j (t, n )uj(t, n ) 8) ttj(Sn), 
j 

with positive eigenvalues <7j(E n ) > and orthonormal eigenvectors Wj(£„). In- 
deed, the solution of (1.5) is unique since the functional S —> F(S) = ||S„ — S|||+ 
A||S||i is strictly convex. A sufficient condition of minimum is e <9P(E A ) = 
-2(S„ - E A ) + XV with V e <9||S A ||i. We consider the following choice of 

V = E J:CTj (S„)>A/2^(Sn) ® %(S„) + G 0||E A ||l With 
j:a j (£ n )<\/2 

It is easy to check that <9P(£ A ) = -2(E„ - E A ) + AT> = 0. 
Next, we have on the event A > 2||Ai||oo 

||E A - SHoo < ||S A - SftHoo + IIAilU < A. 
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5.2. Proof of Proposition 3 

The delicate part of this proof is to obtain the sharp dependence on 5. As a 
consequence, the proof is significantly more technical as compared to the case 
of full observations 5 = 1. To simplify the understanding of this proof, we 
decomposed it into three lemmas that we prove below. 



PROOF. Define 

4*) = £« - diag (EW) , A« = - diag 

Wc have 



<5" 1 



diag(sW-SW 



Now combining a simple union bound argument with Lemmas 2, 3 and 4, we 
get with probability at least 1 — 4e~* that 



|tr(£W)-to(E)|<<7crHr(£) 



max ■ 



t i 
n ' n 



and 



max max 

i<j<p 



t + logp t + logp 



S 2 i 



5n 



trC^HSlui+MM , tr(S) + t + logn) 



where C > is an absolute constant. 

Noting finally that maxi<j< p (Sjj) < ^/tr(S)]|S||^ A tr(E) and tr(S„) = 
5 -1 tr(En ), we can conclude, up to a rescaling of the absolute constant C > 0, 
that (3.5) and (3.6) hold true simultaneously with probability at least 1 — e - *. 

Lemma 2. Under the assumptions of Proposition 3, we have with probability 
at least 1 — e~* that 



diag (S^ - £ (<5) ) < CV+ 1 max (£#) max ^ 



t + logp t + logp 



(5.5) 



where C > is an absolute constant. 



proof. We have 

diag (£«-£<*) 



= max 
oo i<j<p 



1 " 9 
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Next, since the random variables Sij and X^' are sub-gaussian for any i,j, we 
have 





< 2 


h^ 3) 


i 

< 2 










1p2 





■02 



< 2c x Ejj, 



where we have used Assumption 1 in the last inequality. We can apply Bern- 
stein's inequality (see Proposition 1 in the appendix below) to get for any 
1 < j < P with probability at least 1 — e~* that 



1 11 9 



n ' n 



where C > is an absolute constant. Next, taking t' = t + logp combined with 
a union bound argument we get the result. 

Lemma 3. Under the assumptions of Proposition 3, we have with probability 
at least 1 — 2e _t that 



||^-^ w IU<cS 

Cl 



max < 5 



r(E) (i + log(2p)) 



r(£)(i + log(2p)) 



(c^ + t + logn) Y (5.6) 



where C > is a large enough absolute constant. 



PROOF. We have 

n 

where 

Zi = Y l ®Y l - diag (Yj ® Y») , 1 < i < n. 

Define Y = (<5iX«, . . .,^I<p)) t where 5 X , . . . ,S P are i.i.d. Bernoulli random 
variables with parameter 6 independent from X and Z = Y<£>Y — diag ( Y £g> Y) . 
We want to apply the noncommutative Bernstein inequality for matrices. To this 
end, we need to study the quantities |E(Z — EZ) 2 j|oo and \\Z — EZ||oo- 

We note first that ||E(Z - EZ) 2 ^ < HEZ 2 ^. Next, we set V = Z + 
5diag(X ® X) and TY = <5diag(X ® X). Some easy algebra yields that 

pz 2 !^ < (y pY 2 ^ + VlliE^lloo) 2 . (5.7) 

We now treat pY 2 ||oo and ||EW 2 ||oo separately. Denote by E 5 and E x the 
expectations w.r.t. (8±, ■ ■ ■ ,S P ) and X respectively. We have EY 2 = ExE^Y 2 . 
Next, we have 



l>(Y( fe )) 2 |Y| 2 iik = i, 

\S 3 XWX®\X\Z otherwise. 
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Consequently, we get for any u = (u\, ■ ■ ■ , u p ) T € W with = 1 that 



Eu T V 2 u = 5 2 8E X [\X\ 2 2 (X T u) 2 ] + (1 - S)E X 



<5 2 



.x\X\\ [ 5^E x (X^uY + (1-5)J2 ^E x {X^Yu 2 



(5.8) 



where we have applied Cauchy-Schwarz's inequality. 

We have again by Cauchy-Schwarz's inequality and Assumption 1 that 



3=1 j,k=l:j=£k 



< 



r n 1/2 r 

^E(X W ) 4 + E(I W ) 



1/2 



i=i 



j,k=l:j^k 
2 



2 

j/) 2 



< c\J2\\x&\ 

< Cc{ 2 (tr(£)) 2 , 

for some absolute constant C > 0. We have also, in view of (3.3), with the same 
absolute constant C as above 



y/E x (X,u)* < C\\(X,u)\\ 2 < Ccr'HEHoo, Vue« p : \u\ 2 = 1, 



and 



y/E X (XU))4 < C\\X«% a < CCi%j, l<J<p. 
Combining the three above displays with (5.8), we get 



i^ 2 ||oo < Cq^ 2 tr(E) 

< C Cl " 2 £ 2 tr( s )ll s lloo, 



5||E|| 00 + (1-<J) max(E^) 
i<j<p 



(5.9) 



and 



|lW 2 ||oo = S 2 max E x (x&) < Cc{ 2 8 2 max (S 2 ,) < C C] ; 2 <5 2 tr(5])||5]|| c 
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Combining the two above displays with (5.7), we get 

HEZ'Hoo < Ccr 2 <5 2 tr(£)l|£||oc, 

where C > is an absolute constant. 
Next, we treat \\Z - EZ||oo. We have 

Wz-ezw^ < iiziioc + IIEziioc < ly^ + ^nsiu, 

where we have used that ||-£||oo < ||^ ®^||oo = an d 

HEZIU = - diagCsW)^ = 5 2 ||S - diag^Hoo < <S 2 ||£||oc. 
In view of Assumption 1, we have 

IML ^ E H^-C^^) 2 !! < 2 it\\ X(j) t 2 ^ 2^ 1 tr(E). 

3=1 1 3=1 2 

Then, combining Proposition 1 with a union bound argument gives for any 
t > 

P ( max \Yi\l < tr(S) (<5 + Cc[ 1 (t + logn)) ) > 1 - e - *, 

\ l<i<n y 

where C > is an absolute constant. 
Define 

U = tr(S) (C _1 ci<5 + i + logn) , 

and 



^nua jvMSJIlEHoo^^^S., tr(E) M + t + logn) 



ti = C'c 



where C" > is a large enough absolute constant. 
We have 



t + log(2p) 



A^IU > ii) < P |{||4 5) - A^IU > t x } n f| < 

/ n > 

+p U W >E/ ) 



Vi=l 



< P ||4*) -A^IU > tx | f| {l^l 2 < U} +e 



< 2e - *, 



where we have used Proposition 2 to get that 

- ^ (<5) ||oo > ii} I n{M2<tf}) <e" 



1=1 
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Lemma 4. Under the assumptions of Proposition 3, we have with probability 
at least 1 — e - * that 



|tr(sW) - <5tr(E)| < Cc^trfE) max ^ , 



(5.10) 



where C > is an absolute constant. 



PROOF. In view of Assumption 1, we have for any 1 < j < p that ||(y^) 2 ||vM < 
||(X«) 2 || v , 1 <2||XW||2 2 <2cr 1 E,,-and 



milU<EF ( ^ 



< 



(A^) 2 <2VP <2c ] ; 1 tr(E). 

II ^2 

J = l 



Next, we have 

tr(E^)-5tr(E) = tr(E^ - E^ } ) 



tr f - X l -E(X®X) | 

i ™ 

- V tr(y <g> y) - e (tr(y ® y)) 

n ' 

t=l 

n 

i£|y| 2 -E|y| 2 . 



i = l 



Next, we have 



Ml 



l^lilUi < IIMIlIk <2c r 1 tr(E). 
Then we can apply Proposition 1 to get the result. 



5. 3. Proof of Lemma 1 

In view of Proposition 3, we have on an event A of probability at least 1 — ^ 
that 



Elloo < C [ - 



Cl 



r(E)log2p 



5h 



(5.11) 



We assume further that (3.7) is satisfied with a sufficiently large constant c so 
that we have, in view of (3.5) and (3.6), on the same event A that 



and 



EIU < 



|tr(E n )-tr(E)| < 



2 

tr(E) 
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We immediately get on the event A that 

1 3 

and 

itr(E) < tr(S n ) < |tr(E). 
Combining these simple facts with (5.11), we get the result. 

5-4- Proof of Theorem 2 

This proof uses standard tools of the minimax theory (cf. for instance [35]). 
However, as for Proposition 3, the proof with missing observations (8 < 1) is 
significantly more technical as compared to case of full observations (S = 1). 
In particular, the control of the Kullback-Leibler divergence requires a precise 
description of the conditional distributions of the random variables Y\ , . . . , Y n 
given the masked variables Si, . . . , 5 n . To our knowledge, there exists no mini- 
max lower bound result for statistical problem with missing observations in the 
literature. 



PROOF. 

Set 7 = a/ v 7 5 2 n where a > is a sufficiently small absolute constant. 
We consider first the case r > 2. Define 

N = {E k ,i + Et, k , l<k<r-l,k + l<l<r}. 

Set Bk.i = Ek.i + Ei t k for any 1 < k < r — 1, k + 1 < I < r. Consider the 
associated set of symmetric matrices 



E f = 



_ ( Ir +TELi SLfc+1 e k,lBk,l 



o 



o 



o 



(cfc,i)fc,j e {0,1}^ 



Note that any matrix E e G B(Af) is positive-semidefinite if < a < 1 since we 
have by assumption 



r— 1 r 
k=l l=k+l 



<ir = a\l—<a. 



(5.12) 



By construction, any element of B(N) as well as the difference of any two 
elements of B(AT) is of rank exactly r. Consequently, B(N~) C C r since r(E e ) < 
rank(E e ) < r for any E e g B{N). Note also that for any E £ G B(Af), we have 
tr(E e ) = r and < 1 — a < ||E e ||oo < 1+a provided that < a < 1 and 
consequently r/ (1 + a) < r(E e ) < r/(l — a). Indeed, we have 



||S 6 ||oo < 1 + 7 



fe=i 



< 1 + 7r < 1 + a, 
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in view of the condition n > 5 2 r 2 . A similar reasoning gives the lower bound. 

Denote by Aq the px p block matrix with first block equal to I r . Varshamov- 
Gilbcrt's bound (cf. Lemma 2.9 in [35]) guarantees the existence of a subset 
A" C B(Af) with cardinality Card(„4°) > 2 r ( r - 1 )/ 16 + 1 containing A Q and such 
that, for any two distinct elements S e and S c / of A a , we have 

„^ „ „o of(r — l) r 2 a 2 r 2 

^ WT^II^II-ll 1 ^^' ^ e ^ (5 - 13) 

Let Xl, . . . , X n € R p be i.i.d. N (0, S e ) with £ e e .4°. For the sake of brevity 
we set £ = £ e . Recall that 5%, . . . ,S n are random vectors in R p whose entries Sij 
are i.i.d. Bernoulli entries with parameter S independent from (Xi, • ■ ■ , X n ) and 
that the observations Y±,...,Y n satisfy = SijX^ . Denote by Pe the distri- 
bution of (Yi, • • • , Y n ) and by Pj? the conditional distribution of (Yi, • • • , Y n ) 
given (<5i, • • • , 5 n ). Next, we note that, for any 1 < i < n, the conditional random 
variables Yj | Si are independent Gaussian vectors N(0,T,^ Si ^) where 

I diji-ijj otherwise. 

Thus, wc have fL' = ®JL]Pj;(«<)- Denote respectively by P5 and E5 the 
probability distribution of (Si,- - ■ ,S n ) and the associated expectation, and by 
Kg i the expectation w.r.t Si for any 1 < i < n. We also denote by Eg and 
E^P the expectation and conditional expectation associated respectively with 
P E and P^ } . 

Next, the Kullback-Leibler divergences K (P^P^) between Fa and P E sat- 
isfies 



,'dP A \ (d(F s ®F ( P)\ r^i / - 71P>(<5) 

A' (P Ao , P S ) = E Ao log ( ^ ) = E Ao log ( 1* ] = E,E<£ log 




(5.15) 



Using that Y t \ S t ~ A(0,E (,5l) ) with £( 5 *) defined in (5.14), wc get for any 
1 < i < n, any S € .4° and any realization 5i(w) S {0, 1} P that 

1. P E ( 5i ("» < P 4 («iM) and hence K(F , P S (M-))) < 00. 

2. P^csiCw)) and P ^re supported on a dj(a;)-dimensional subspace of 
W where d, t = ^f. =1 S id ~ Bin(r, 5). 



Define J; = {j : 5 ij-7 - = 1, 1 < j < r}. Define the mapping P 4 : RP -> R d - as 
follows Pi(x) = xj 4 where for any a; = (x (1 V • • ,x^) T G R p , x.j l e R d * is 
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obtained by keeping only the components x^ with their index k € J{. We 
denote by P* : R d ' -> R p the right inverse of P % . 

We note that P t Af l] P* = I di and 

r— 1 r 

PiUMpr =i di + 7 ^2 E zk,iPiB k ,iP*-$. keJ ^ leJi 

k=l l=k+l 

= Id, + w t . 

Thus we get that 

A (P^ , , P E(<4) ) = A' (P /di , F Ich +Wt ) 

= itr (I di + Wi)-\ log (det (7 di + W l )) - |. 

Denote by Ai, . . . , A^ the eigenvalues of Wj. Note that |Aj| < 1/2 for any j = 
1, . . . , di in view of (5.12) if a < 1/2. We get, using the inequality x — \og(l + x) < 
x for any x > —1/2, that 



A(P At5i ),P s(5i3 ) < 



j'=i 

r— 1 r 

Y 2 

2 

<7 2 (rf-"^)- (5.16) 



r — 1 r 

^ 2 E E ll^iullpfeej^ 



fe=i ;=fe+i 



Taking the expectation w.r.t. to Si in the above display, we get for any 1 < i < n 
that 

E Si K (p^o.Pec**)) < 7 2 E 5l (d 2 - *) < 7 2 5 2 r(r - 1), 
since <ii ~ Bin(r, 5). Combining the above display with (5.15), we get 

A(P Ao ,P s ) < n-y 2 8 2 r{r - 1) = na 2 -!—5 2 r{r - 1) < a 2 r(r - 1). 
Thus, we deduce from the above display that the condition 

Cardf L_ ! E ^( P a ,P S ) < clog (Card(^ ) - l) (5.17) 

is satisfied for any a > if a > is chosen as a sufficiently small numerical 
constant depending on a. In view of (5.13) and (5.17), (4.1) now follows by 
application of Theorem 2.5 in [35]. 
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The lower bound (4.2) follows from (4.1) by the following simple argument. 
Consider the set of matrices A . For any two distinct matrices Si, S2 of A , we 
have 



|Si - S 



2 || x.. 



> 



' (1 - a)a 2 
y 16(1 + a) 2 

Indeed, if (5.18) does not hold, we get 

,2 



r(S 6 -) 
5 2 n 



VS e ~ G A . 



(5.18) 



Fi-*\\l< (1 - a)a 



16(1 + a) 2 



^IlL^"**^). VS g G A°, 



since rank(Si — S2) < r by construction of .4°. This contradicts (5.13). 

Next, (5.17) is satisfied for any a > if a > is chosen as a sufficiently small 
numerical constant depending on a. 

Combining (5.18) with (5.17) and Theorem 2.5 in [35] gives the result. 

The case r = 1 can be treated similarly and is actually easier. Indeed if 
r(S) = 1, then we have tr(S) = ||S||oo and rank(S) = 1. Consequently, we can 
derive the lower bound by testing between the two hypothesis 



£0 = 















and Ei 



1 + 7 





O 






where So and Si are pxp covariance matrices with only one nonzero component 
on the first diagonal entry. For these covariance matrices, we have tr(So) = 
HEolloo = 1 and tr(Si) = |jSi||oo = 1 + 7 < 2. Thus we have 



Si 



I 2 

I OO 



Si||2> — >c||S 
o z n 



,2 £(Sj) 



0,1 



for some absolute constant c > 0. The rest of the proof is identical to the case 
r > 2. 
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